This dataset collects information from 100k medical appointments in Brazil and is focused on the question of whether or not patients show up for their appointment. A number of characteristics about the patient are included in each row. The analysis will be tackling 2 questions which help with the understanding of the charactersitics that cause patients to miss their appointments in Brazil based on the data from 2016.
This analysis will be addressing 2 questions as shown below:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
%matplotlib inline
# Loading the data and printing out a few lines.
#Inspecting data types and looking for instances of missing or possibly errant data.
df = pd.read_csv("noshowappointments-kagglev2-may-2016.csv")
df.info()
Following are the columns found in this dataset.
df.columns
df.describe()
The following columns are to be converted to the datetime data type:
This is what the following columns signify:
df.head(2)
print("The dataframe has {} rows and {} columns".format(df.shape[0],df.shape[1]))
print("The number of duplicate patient ids found are:",df['PatientId'].duplicated().sum())
#print("\n")
print("The number of duplicate appointment ids found are:",df['AppointmentID'].duplicated().sum())
#number of unique non-repeating values for each feature
df[['PatientId','AppointmentID','Age','Neighbourhood']].nunique(dropna=True)
This tells us that out of 110,527 total entries only 62,299 unique patients were registered in the system. 48,228 patient ids are duplicates so that means that they are repeating patients.
Drop the following column(s):
df.drop('AppointmentID',axis=1,inplace=True)
print("The dataframe now has {} columns".format(df.shape[1]))
df.head()
Now, the ScheduledDay and AppointmentDay columns have to be converted to date time type from string type.
df['ScheduledDay_date'] = df['ScheduledDay'].str.extract('(\d\d\d\d-\d\d-\d\d)', expand=True)
df['ScheduledDay_time'] = df['ScheduledDay'].str.extract('(\d\d:\d\d:\d\d)', expand=True)
Function used to move columns in the dataframe is taken from https://towardsdatascience.com/reordering-pandas-dataframe-columns-thumbs-down-on-standard-solutions-1ff0bc2941d5
#Function used to move columns in the dataframe
def movecol(df, cols_to_move=[], ref_col='', place='After'):
cols = df.columns.tolist()
if place == 'After':
seg1 = cols[:list(cols).index(ref_col) + 1]
seg2 = cols_to_move
if place == 'Before':
seg1 = cols[:list(cols).index(ref_col)]
seg2 = cols_to_move + [ref_col]
seg1 = [i for i in seg1 if i not in seg2]
seg3 = [i for i in cols if i not in seg1 + seg2]
return(df[seg1 + seg2 + seg3])
df = movecol(df,
cols_to_move=['ScheduledDay_date','ScheduledDay_time'],
ref_col='Gender',
place='After')
df.head()
Now the dataframe looks like this:
df['AppointmentDay_date'] = df['AppointmentDay'].str.extract('(\d\d\d\d-\d\d-\d\d)', expand=True)
df['AppointmentDay_time'] = df['AppointmentDay'].str.extract('(\d\d:\d\d:\d\d)', expand=True)
df = movecol(df,
cols_to_move=['AppointmentDay_date','AppointmentDay_time'],
ref_col='Age',
place='Before')
df.head()
Drop the following columns:
df.drop(['ScheduledDay','AppointmentDay'],axis=1,inplace=True)
print("The dataframe now has {} columns".format(df.shape[1]))
Dataframe as of now
df.head()
Inspecting Datatypes:
df['ScheduledDay_date'] = pd.to_datetime(df['ScheduledDay_date'],format='%Y-%m-%d')
#same thing as above df['ScheduledDay_date'] = df['ScheduledDay_date'].dt.date.astype('datetime64')
df['ScheduledDay_time'] = pd.to_datetime(df['ScheduledDay_time'],format='%H:%M:%S')
#df['ScheduledDay_time'] = pd.to_datetime(df['ScheduledDay_time']).strftime("%H:%M:%S")
df['AppointmentDay_date'] = pd.to_datetime(df['AppointmentDay_date'],format='%Y-%m-%d')
df['AppointmentDay_time'] = pd.to_datetime(df['AppointmentDay_time'],format='%H:%M:%S')
df.dtypes
df.head()
df['ScheduledDay_date'].describe()
df['AppointmentDay_time'].describe()
Dropping AppointmentDay_time column:
#Drop AppointmentDay_time column
df.drop(['AppointmentDay_time'],axis=1,inplace=True)
print("The dataframe now has {} columns".format(df.shape[1]))
Breaking down appointment day and scheduled day columns into individual columns and organizing the columns will look something like this:
#Breaking down appointment day and scheduled day columns into individual columns
df['Appt_Month'] = df['AppointmentDay_date'].dt.month
df["App_Day"] = df['AppointmentDay_date'].dt.day
df['Appointment_Weekday'] = df['AppointmentDay_date'].dt.dayofweek
df['Scheduled_hour'] = df['ScheduledDay_time'].dt.hour
df['Scheduled_Weekday'] = df['ScheduledDay_date'].dt.dayofweek
#organizing the columns
df = movecol(df,
cols_to_move=['Scheduled_hour','Scheduled_Weekday'],
ref_col='ScheduledDay_time',
place='After')
df = movecol(df,
cols_to_move=['Appt_Month','App_Day','Appointment_Weekday'],
ref_col='AppointmentDay_date',
place='After')
df.head()
It is assumed the week starts on Monday, which is denoted by 0 and ends on Sunday which is denoted by 6.
df.Scheduled_Weekday.value_counts()
Age column needs cleaning as there is a negative value
df.Age.describe()
Finding the entry with negative age value:
#finding the entry with negative age value
df.query('Age < 0')
Dropping the negative age value and checking if the operation was successful:
#dropping the negative age value
df.drop(99832,axis=0,inplace=True)
#checking if the negative age value is dropped
df.query('Age < 0')
#verify age column for valid entries
df.Age.describe()
Cleaning the Handicap Column:
#remove entries for 2, 3 and 4 as handicap is a yes or a no factor 0 signifies no and 1 signifies yes
df.Handcap.value_counts()
#dropping incorrect entries in handcap column
df.drop(df.loc[(df['Handcap']!=0) & (df['Handcap']!=1)].index, inplace=True)
Checking handcap column for correct entries:
#checking handcap column for correct entries
df.Handcap.value_counts()
Cleaning the Neighbourhood Column:
df.Neighbourhood.value_counts()
The following Neighbourhoods will be removed due to insuffecient and inaccurate data:
#removing the neighbourhood
df.drop(df.loc[(df['Neighbourhood']=='ILHAS OCEÂNICAS DE TRINDADE')].index, inplace=True)
#removing the neighbourhood
df.drop(df.loc[(df['Neighbourhood']=='PARQUE INDUSTRIAL')].index, inplace=True)
Checking if removal of the neighbourhoods was successful
#checking if removal of the neighbourhoods was successfull
df.Neighbourhood.value_counts()
Inspecting the dataframe:
# inspect the dataframe
df.info()
#preview of dataframe
df.head()
Renaming the Columns now
df.rename(columns={"PatientId": "Patient_ID", "ScheduledDay_date":"Scheduled_Date","ScheduledDay_time":"Scheduled_Time","Scheduled_hour":"Scheduled_Hour","AppointmentDay_date":"Appointment_Date","Appt_Month":"Appointment_Month","App_Day":"Appointment_Day","Hipertension":"Hypertension","Handcap":"Handicap","SMS_received":"SMS_Received","No-show":"Appointment_Missed"}, inplace=True)
df.head()
Making a copy of the original dataframe, so that a dataframe filtered for only patients who missed their appointments can be created as shown below:
df_new = df.copy()
df_yes = df_new[df_new['Appointment_Missed'] == 'Yes']
print(df_yes.shape)
df_yes.head()
#Gender Factor Analyzed
plt.figure(figsize=(11,8))
sns.set(font_scale=1.35)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
gender_count = sns.countplot(x='Gender', data=df, hue= 'Appointment_Missed' ,palette=['#ffa600',"#007e6d"],edgecolor=(0,0,0), linewidth=2);
gender_count.set_xticklabels(["Female","Male"]);
for p in gender_count.patches:
txt = str(p.get_height().round(2))
txt_x = p.get_x()
txt_y = p.get_height()
gender_count.text(txt_x+0.1,txt_y+8,txt)
Based on the plot above, Females attended and missed more number of appointments than Males. In conclusion, number of missed appointments by both genders were less than the number of attended appointments.
df.groupby(['Gender'])['Appointment_Missed'].value_counts(normalize=True)
#Gender Factor Normalized
sns.set(font_scale=1.2)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
gender = df.groupby(['Gender'])['Appointment_Missed'].value_counts(normalize=True)
gender = gender.mul(100)
gender = gender.rename('percent').reset_index()
gender_graph = sns.catplot(x='Gender',y='percent',hue='Appointment_Missed',kind='bar',data=gender,palette=['#ffa600',"#007e6d"], edgecolor=(0,0,0), linewidth=2,legend_out=False)
gender_graph.set_xticklabels(["Female","Male"]);
gender_graph.fig.set_size_inches(8,7)
gender_graph.ax.set_ylim(0,100)
for p in gender_graph.ax.patches:
txt = str(p.get_height().round(1)) + '%'
txt_x = p.get_x()
txt_y = p.get_height()
gender_graph.ax.text(txt_x+0.10,txt_y+0.46,txt)
Due to the difference in sample size for both genders, hence normalization was used to make a fair judgement. Given the above plot, there is no clear trend to be observed hence the gender factor is not a good idicator as neither of the gender have more patients miss their appointment than attended.
Now the age factor will be analyzed as shown below:
Age Bracket Breakdown:
To make the analysis of the Age factor simple, the age values were divided in 3 categories:
As shown in the plot below, the adult category has the most number of counts for both appointment missed and appointment attended compared to the remaining age categories.
# create a list of the conditions
conditions = [
(df['Age'] <= 14),
(df['Age'] >= 15) & (df['Age'] <= 64),
(df['Age'] >= 65)]
# create a list of the values we want to assign for each condition
values = ['Children_Adolescents', 'Adult', 'Senior']
# create a new column and use np.select to assign values to it using our lists as arguments
df['Age_Group'] = np.select(conditions, values)
df = movecol(df,
cols_to_move=['Age_Group'],
ref_col='Age',
place='After')
# display updated DataFrame's head view
df.head(4)
#Age Group Factor Analyzed
plt.figure(figsize=(11,8))
sns.set(font_scale=1.35)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
age_1 = sns.countplot(x='Age_Group', data=df, hue= 'Appointment_Missed' ,palette=['#ffa600',"#007e6d"],edgecolor=(0,0,0), linewidth=2);
for p in age_1.patches:
txt = str(p.get_height().round(2))
txt_x = p.get_x()
txt_y = p.get_height()
age_1.text(txt_x+0.1,txt_y+8,txt)
Code source: https://www.thetopsites.net/article/52692083.shtml
sns.set(font_scale=1.2)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
age = df.groupby(['Age_Group'])['Appointment_Missed'].value_counts(normalize=True)
age = age.mul(100)
age = age.rename('percent').reset_index()
age_graph = sns.catplot(x='Age_Group',y='percent',hue='Appointment_Missed',kind='bar',data=age,palette=['#ffa600',"#007e6d"], edgecolor=(0,0,0), linewidth=2,legend_out=False)
age_graph.fig.set_size_inches(8,7)
age_graph.ax.set_ylim(0,100)
for p in age_graph.ax.patches:
txt = str(p.get_height().round(1)) + '%'
txt_x = p.get_x()
txt_y = p.get_height()
age_graph.ax.text(txt_x+0.07,txt_y+0.46,txt)
Due to the difference in the number of samples in each age group, the propotion for each age group was derived as shown above. The new plot depicts all three age groups with the proportion of appointment missed and appointment attended, with the senior age group having the most patients attending their appointments at 84.6% compared to about 80% of the remaining age groups. Due to normalization as shown above Children_Adolescents could face more appointments being missed compared to other age groups.
To conclude, the age factor had more patients attending their appointments than missing it in all the age groups, hence the age factor is not a good indicator to show if a patient will show up for their scheduled appointment or not.
Analyzing the hours at which the appointment was booked and whether the hour factor affected the appointment being missed or not.
#Scheduled Hour Factor Analyzed
plt.figure(figsize=(30,10))
sns.set(font_scale=1.5,style='darkgrid')
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
descending_order = df['Scheduled_Hour'].value_counts().sort_values(ascending=False).index
scheduled_hour_1 = sns.countplot(x='Scheduled_Hour', data=df, hue='Appointment_Missed', palette=['#ffa600',"#007e6d"],edgecolor=(0,0,0), linewidth=2, order=descending_order);
for p in scheduled_hour_1.patches:
txt = str(p.get_height().round(2))
txt_x = p.get_x()
txt_y = p.get_height()
scheduled_hour_1.text(txt_x-0.01,txt_y+1,txt)
Analyzing the above plot, majority of appointments were booked at 7 am, followed by 8,9,10 am with least number of appointments attended and missed at 21:00 Hours or 9 pm. Due to the difference in the number of samples for each scheduled hour, the proportion of appointments missed and attended for each scheduled hour was derived as shown below.
#proportion plot
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
descending_order = df['Scheduled_Hour'].value_counts().sort_values(ascending=False).index
scheduled_hour = df.groupby(['Scheduled_Hour'])['Appointment_Missed'].value_counts(normalize=True)
scheduled_hour = scheduled_hour.mul(100)
scheduled_hour = scheduled_hour.rename('percent').reset_index()
scheduled_hour_graph = sns.catplot(x='Scheduled_Hour',y='percent',hue='Appointment_Missed',kind='bar',data=scheduled_hour,palette=['#ffa600',"#007e6d"], edgecolor=(0,0,0), linewidth=3,legend_out=False,order=descending_order)
scheduled_hour_graph.fig.set_size_inches(40,11)
scheduled_hour_graph.ax.set_ylim(0,100)
sns.set(font_scale=2)
#scheduled_hour_graph.set_xticklabels(["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"]);
for p in scheduled_hour_graph.ax.patches:
txt = str(p.get_height().round(1)) + '%'
txt_x = p.get_x()
txt_y = p.get_height()
scheduled_hour_graph.ax.text(txt_x,txt_y+0.4,txt)
To conclude, after normalizing the data above, the 7 am time seems to have the least number of missed appointments and the most appointments attended. Other than that the scheduled hour factor had more or less similar proportion figures of patients attending their appointments or missing it in the scheduled hours as shown above and none of the hours had more patients missing their appointments than attending them.
Now breaking down the scheduled date factor further, I analyzed the weekday at which the appointment was booked and whether this factor affected the appointment being missed or not.
#Scheduled Weekday Factor Analyzed
plt.figure(figsize=(30,10))
sns.set(font_scale=1.5)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
scheduled_weekday_1 = sns.countplot(x='Scheduled_Weekday', data=df, hue='Appointment_Missed', palette=['#ffa600',"#007e6d"],edgecolor=(0,0,0), linewidth=2.5);
scheduled_weekday_1.set_xticklabels(["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"]);
for p in scheduled_weekday_1.patches:
txt = str(p.get_height().round(2))
txt_x = p.get_x()
txt_y = p.get_height()
scheduled_weekday_1.text(txt_x+0.1,txt_y+6.8,txt)
Tuesday among the other weekdays seems to have the most number of appointments missed and attended followed by Wednesday and Monday. Whereas Saturday witnessed the least number of appointments attended and missed. Due to the difference in the number of samples for each scheduled weekday, the proportion of appointments missed and/or attended for each weekday was derived as shown below.
#proportion plot
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
scheduled_weekday = df.groupby(['Scheduled_Weekday'])['Appointment_Missed'].value_counts(normalize=True)
scheduled_weekday = scheduled_weekday.mul(100)
scheduled_weekday = scheduled_weekday.rename('percent').reset_index()
scheduled_weekday_graph = sns.catplot(x='Scheduled_Weekday',y='percent',hue='Appointment_Missed',kind='bar',data=scheduled_weekday,palette=['#ffa600',"#007e6d"], edgecolor=(0,0,0), linewidth=2,legend_out=False)
scheduled_weekday_graph.fig.set_size_inches(20,11)
scheduled_weekday_graph.ax.set_ylim(0,100)
scheduled_weekday_graph.set_xticklabels(["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"]);
for p in scheduled_weekday_graph.ax.patches:
txt = str(p.get_height().round(1)) + '%'
txt_x = p.get_x()
txt_y = p.get_height()
scheduled_weekday_graph.ax.text(txt_x+0.07,txt_y+0.4,txt)
In conclusion, the scheduled weekday factor had very similar proportion figures of patients attending their appointments or missing it in the scheduled weekday as shown above with an exception of saturday but due to the lack of suffecient samples collected for that day a conclusive statement could not be produced. Therefore no weekday had more patients missing their appointment than appointments attended.
In the appointment date factor only the date was avaliable. So this factor was broken down into 2 subfactors: appointment month and weekday.
In this subfactor the month at which the appointment was booked for was analyzed. The analysis included whether the month of the appointment had an impact on the appointment being missed or not.
#Appointment Date Factor Analyzed
plt.figure(figsize=(10,10))
sns.set(font_scale=1.5)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
appointment_month_1 = sns.countplot(x='Appointment_Month', data=df, hue='Appointment_Missed', palette=['#ffa600',"#007e6d"],edgecolor=(0,0,0), linewidth=2.5);
appointment_month_1.set_xticklabels(["April","May","June"]);
for p in appointment_month_1.patches:
txt = str(p.get_height().round(2))
txt_x = p.get_x()
txt_y = p.get_height()
appointment_month_1.text(txt_x+0.08,txt_y+6.8,txt)
Analyzing the count plot above, the month of May received the most number of attended and missed appointments, followed by June and April. Due to the difference in the number of samples collected the proportion of appointments missed and/or attended for each month was derived as shown below.
#proportion plot
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
appointment_month = df.groupby(['Appointment_Month'])['Appointment_Missed'].value_counts(normalize=True)
appointment_month = appointment_month.mul(100)
appointment_month = appointment_month.rename('percent').reset_index()
appointment_month_graph = sns.catplot(x='Appointment_Month',y='percent',hue='Appointment_Missed',kind='bar',data=appointment_month,palette=['#ffa600',"#007e6d"], edgecolor=(0,0,0), linewidth=2,legend_out=False)
appointment_month_graph.fig.set_size_inches(13,11)
appointment_month_graph.ax.set_ylim(0,100)
appointment_month_graph.set_xticklabels(["April","May","June"]);
for p in appointment_month_graph.ax.patches:
txt = str(p.get_height().round(1)) + '%'
txt_x = p.get_x()
txt_y = p.get_height()
appointment_month_graph.ax.text(txt_x+0.09,txt_y+0.4,txt)
Analyzing the graph above, there is no pattern to make a sound conclusion and in both plots the number of patients missing their appointment was less than the attended appointment number.
In this subfactor the weekday at which the appointment was booked for was analyzed. The analysis included whether the weekday that the appointment happens to fall on had an impact on the appointment being missed or not.
#Appointment Weekday Factor Analyzed
plt.figure(figsize=(30,10))
sns.set(font_scale=1.5)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
appointment_weekday_1 = sns.countplot(x='Appointment_Weekday', data=df, hue='Appointment_Missed', palette=['#ffa600',"#007e6d"],edgecolor=(0,0,0), linewidth=2.5);
appointment_weekday_1.set_xticklabels(["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"]);
for p in appointment_weekday_1.patches:
txt = str(p.get_height().round(2))
txt_x = p.get_x()
txt_y = p.get_height()
appointment_weekday_1.text(txt_x+0.1,txt_y+6.8,txt)
Analyzing the count plot above, Tuesday and Wednesday are the busiest day, both the days received similar number of attended and missed appointments, followed by Monday, Friday and then Thursday. Due to the difference in the number of samples collected the proportion of appointments missed and/or attended for each weekday was derived as shown below.
#proportion plot
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
appointment_weekday = df.groupby(['Appointment_Weekday'])['Appointment_Missed'].value_counts(normalize=True)
appointment_weekday = appointment_weekday.mul(100)
appointment_weekday = appointment_weekday.rename('percent').reset_index()
appointment_weekday_graph = sns.catplot(x='Appointment_Weekday',y='percent',hue='Appointment_Missed',kind='bar',data=appointment_weekday,palette=['#ffa600',"#007e6d"], edgecolor=(0,0,0), linewidth=2,legend_out=False)
appointment_weekday_graph.fig.set_size_inches(20,11)
appointment_weekday_graph.ax.set_ylim(0,100)
appointment_weekday_graph.set_xticklabels(["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"]);
for p in appointment_weekday_graph.ax.patches:
txt = str(p.get_height().round(1)) + '%'
txt_x = p.get_x()
txt_y = p.get_height()
appointment_weekday_graph.ax.text(txt_x+0.07,txt_y+0.9,txt)
Analyzing the graph above, there is no pattern found to make an accurate conclusion, though one thing is certain that is no weekday had more appointments being missed than attended.
In the Neighbourhood factor analysis, a count plot was plotted to check whether any neighbourhood had more missed appointments than the number of appointments attended. As shown below none of the neighbourhood possess more missed appointments than appointments attended. Though, neighbourhood JARDIM CAMBURI has the most number of missed and attended appointments followed by MARIA ORTIZ. On the contrary the least number of appointments were recorded in AEROPORTO neighbourhood.
df.groupby('Neighbourhood')['Appointment_Missed'].value_counts().sort_values(ascending=False)
#Neighbourhood Factor Analyzed
plt.figure(figsize=(110,120))
sns.set(font_scale=3,style='darkgrid')
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
descending_order = df['Neighbourhood'].value_counts().sort_values(ascending=False).index
sns.countplot(y='Neighbourhood',hue ='Appointment_Missed', data=df,palette=['#ffa600',"#007e6d"],edgecolor=(0,0,0), linewidth=2, order=descending_order);
#proportion plot
plt.figure(figsize=(110,120))
sns.set(font_scale=3)
descending_order = df['Neighbourhood'].value_counts().sort_values(ascending=False).index
#sns.set(font_scale=1.25)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
(df.groupby(['Neighbourhood'])['Appointment_Missed']
.value_counts(normalize=True)
.rename('Proportion')
.reset_index()
.pipe((sns.barplot,"data"), x='Proportion', y='Neighbourhood', hue='Appointment_Missed', palette=['#ffa600',"#007e6d"], edgecolor=(0,0,0), linewidth=2, order=descending_order));
Due to each neighbourhood having not similar number of samples hence to make the analysis more just, a proportion plot was produced as shown above. In conclusion, both plots show that no neighbourhood has more number of patients missing their appointment than attending them.
The Scholarship factor indicates whether or not the patient is enrolled in Brasilian welfare program called Bolsa FamÃlia. As shown below in the count plot, patients with scholarship had less appointment missed than those patients who have no scholarship. Also more patients had no scholarship.
#Scholarship Factor Analyzed
plt.figure(figsize=(10,10))
sns.set(font_scale=1.5)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
scholarship_1 = sns.countplot(x='Scholarship', data=df, hue='Appointment_Missed', palette=['#ffa600',"#007e6d"],edgecolor=(0,0,0), linewidth=2.5);
scholarship_1.set_xticklabels(["No Scholarship","Scholarship: Granted"]);
for p in scholarship_1.patches:
txt = str(p.get_height().round(2))
txt_x = p.get_x()
txt_y = p.get_height()
scholarship_1.text(txt_x+0.1,txt_y,txt)
#proportion plot
sns.set(font_scale=1.7)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
scholarship = df.groupby(['Scholarship'])['Appointment_Missed'].value_counts(normalize=True)
scholarship = scholarship.mul(100)
scholarship = scholarship.rename('percent').reset_index()
scholarship_graph = sns.catplot(x='Scholarship',y='percent',hue='Appointment_Missed',kind='bar',data=scholarship,palette=['#ffa600',"#007e6d"], edgecolor=(0,0,0), linewidth=2,legend_out=False)
scholarship_graph.fig.set_size_inches(10,10)
scholarship_graph.ax.set_ylim(0,100)
scholarship_graph.set_xticklabels(["No Scholarship","Scholarship: Granted"]);
for p in scholarship_graph.ax.patches:
txt = str(p.get_height().round(1)) + '%'
txt_x = p.get_x()
txt_y = p.get_height()
scholarship_graph.ax.text(txt_x+0.1,txt_y+0.8,txt)
Looking at the proportion plot above, if equal amount of samples were present for each option of scholarship then patients granted scholarship would have more missed appointments than those patients with no scholarship, though more data would be required to substantiate the observation made. Although, scholarship factor didn't cause more number of appointments being missed than attended.
The Hypertension factor indicates whether or not the patient has hypertension. As shown below in the count plot, patients with hypertension had less appointment missed than those patients who have no hypertension. Patients with no hypertension missed more appointments than patient with hypertension.
#Hyptertension Factor Analyzed
plt.figure(figsize=(10,10))
sns.set(font_scale=1.5)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
hyptertension_1 = sns.countplot(x='Hypertension', data=df, hue='Appointment_Missed', palette=['#ffa600',"#007e6d"],edgecolor=(0,0,0), linewidth=2.5);
hyptertension_1.set_xticklabels(["Hypertension Patient: No","Hypertension Patient: Yes"]);
for p in hyptertension_1.patches:
txt = str(p.get_height().round(2))
txt_x = p.get_x()
txt_y = p.get_height()
hyptertension_1.text(txt_x+0.1,txt_y+6.8,txt)
#proportion plot
sns.set(font_scale=1.7)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
hypertension = df.groupby(['Hypertension'])['Appointment_Missed'].value_counts(normalize=True)
hypertension = hypertension.mul(100)
hypertension = hypertension.rename('percent').reset_index()
hypertension_graph = sns.catplot(x='Hypertension',y='percent',hue='Appointment_Missed',kind='bar',data=hypertension,palette=['#ffa600',"#007e6d"], edgecolor=(0,0,0), linewidth=2,legend_out=False)
hypertension_graph.fig.set_size_inches(13,11)
hypertension_graph.ax.set_ylim(0,100)
hypertension_graph.set_xticklabels(["Hypertension Patient: No","Hypertension Patient: Yes"]);
for p in hypertension_graph.ax.patches:
txt = str(p.get_height().round(1)) + '%'
txt_x = p.get_x()
txt_y = p.get_height()
hypertension_graph.ax.text(txt_x+0.1,txt_y+0.8,txt)
Looking at the proportion graph above, if equal amount of samples were present for each hypertension category even then hypertension factor doesn't seem to show more number of appointments being missed than attended. Though patients with hypertension seem to have more appoinments being attended than patients with no hypertension.
The Diabetes factor indicates whether or not the patient has diabetes and if being diabetic increases the chance of missing an appointment. As shown below in the count plot, patients with diabetes had less appointment missed than those patients who have no diabetes.
#Diabetes Factor Analyzed
plt.figure(figsize=(8,8))
sns.set(font_scale=1.5)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
e = sns.countplot(x='Diabetes', data=df, hue='Appointment_Missed', palette=['#ffa600',"#007e6d"],edgecolor=(0,0,0), linewidth=2.5);
e.set_xticklabels(["Non Diabetic","Diabetic"]);
for p in e.patches:
txt = str(p.get_height().round(2))
txt_x = p.get_x()
txt_y = p.get_height()
e.text(txt_x+0.1,txt_y+7,txt)
#proportion plot
sns.set(font_scale=1.7)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
diabetes = df.groupby(['Diabetes'])['Appointment_Missed'].value_counts(normalize=True)
diabetes = diabetes.mul(100)
diabetes = diabetes.rename('percent').reset_index()
diabetes_graph = sns.catplot(x='Diabetes',y='percent',hue='Appointment_Missed',kind='bar',data=diabetes,palette=['#ffa600',"#007e6d"], edgecolor=(0,0,0), linewidth=2,legend_out=False)
diabetes_graph.fig.set_size_inches(10,13)
diabetes_graph.ax.set_ylim(0,100)
diabetes_graph.set_xticklabels(["Non Diabetic","Diabetic"]);
for p in diabetes_graph.ax.patches:
txt = str(p.get_height().round(1)) + '%'
txt_x = p.get_x()
txt_y = p.get_height()
diabetes_graph.ax.text(txt_x+0.1,txt_y+0.8,txt)
Looking at the proportion graph above, if equal amount of samples were present for each category, even then diabetes factor doesn't seem to show more number of appointments being missed than attended if patient was diabetic. Though patients with diabetes seem to have more appoinments being attended than patients with no diabetes.
The Alcoholism factor indicates whether or not the patient has alcoholism and if being alcoholic increases the chance of missing an appointment. As shown below in the count plot, patients with alcoholism had less appointment missed than those patients who have no alcoholism.
#Alcoholism Factor Analyzed
plt.figure(figsize=(10,10))
sns.set(font_scale=1.5)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
alc = sns.countplot(x='Alcoholism', data=df, hue='Appointment_Missed', palette=['#ffa600',"#007e6d"],edgecolor=(0,0,0), linewidth=2.5);
alc.set_xticklabels(["No Alcoholism","Yes Alcoholism"]);
for p in alc.patches:
txt = str(p.get_height().round(2))
txt_x = p.get_x()
txt_y = p.get_height()
alc.text(txt_x+0.1,txt_y+6.8,txt)
#proportion plot
sns.set(font_scale=1.7)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
alcoholism = df.groupby(['Alcoholism'])['Appointment_Missed'].value_counts(normalize=True)
alcoholism = alcoholism.mul(100)
alcoholism = alcoholism.rename('percent').reset_index()
alcoholism_graph = sns.catplot(x='Alcoholism',y='percent',hue='Appointment_Missed',kind='bar',data=alcoholism,palette=['#ffa600',"#007e6d"], edgecolor=(0,0,0), linewidth=2,legend_out=False)
alcoholism_graph.fig.set_size_inches(10,13)
alcoholism_graph.ax.set_ylim(0,100)
alcoholism_graph.set_xticklabels(["No Alcoholism","Yes Alcoholism"]);
for p in alcoholism_graph.ax.patches:
txt = str(p.get_height().round(1)) + '%'
txt_x = p.get_x()
txt_y = p.get_height()
alcoholism_graph.ax.text(txt_x+0.1,txt_y+0.8,txt)
Looking at the proportion graph above, if equal amount of samples were present for each category, the alcoholism factor seems to show no effect as patients with alcoholism and those with no alcholism have similar proportions as shown above.
The Handicap factor indicates whether or not the patient is handicap and if being handicap increases the chance of missing an appointment. As shown below in the count plot, patients who are handicap had less missed appointments than those patients who are not handicapped.
df.Handicap.value_counts()
#Handicap Factor Analyzed
plt.figure(figsize=(10,10))
sns.set(font_scale=1.5)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
handicap_1 = sns.countplot(x='Handicap', data=df, hue='Appointment_Missed', palette=['#ffa600',"#007e6d"],edgecolor=(0,0,0), linewidth=2.5);
handicap_1.set_xticklabels(["No Handicap","Handicapped"]);
for p in handicap_1.patches:
txt = str(p.get_height().round(2))
txt_x = p.get_x()
txt_y = p.get_height()
handicap_1.text(txt_x+0.1,txt_y+6.8,txt)
#proportion plot
sns.set(font_scale=1.7)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
handicap = df.groupby(['Handicap'])['Appointment_Missed'].value_counts(normalize=True)
handicap = handicap.mul(100)
handicap = handicap.rename('percent').reset_index()
handicap_graph = sns.catplot(x='Handicap',y='percent',hue='Appointment_Missed',kind='bar',data=handicap,palette=['#ffa600',"#007e6d"], edgecolor=(0,0,0), linewidth=2,legend_out=False)
handicap_graph.fig.set_size_inches(10,13)
handicap_graph.ax.set_ylim(0,100)
handicap_graph.set_xticklabels(["No Handicap","Handicapped"]);
for p in handicap_graph.ax.patches:
txt = str(p.get_height().round(1)) + '%'
txt_x = p.get_x()
txt_y = p.get_height()
handicap_graph.ax.text(txt_x+0.1,txt_y+0.8,txt)
Looking at the proportion graph above, if equal amount of samples were present for each category, then handicap factor doesn't seem to show more number of appointments being missed than attended if patient was handicap. Though patients who are handicapped seems to be attending more appoinments than non-handicapped patients.
The SMS Received factor indicates whether or not the patient has received SMS prior to the appointment and if receiving SMS decreases the number of missed appointments. As shown below in the count plot, majority of patients didn't receive SMS, only 32% received SMS. Patients who received SMS had less missed appointments than those patients who received no SMS.
df.SMS_Received.value_counts(normalize=True)
#SMS Received Factor Analyzed
plt.figure(figsize=(10,10))
sns.set(font_scale=1.5)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
sms_1 = sns.countplot(x='SMS_Received', data=df, hue='Appointment_Missed', palette=['#ffa600',"#007e6d"],edgecolor=(0,0,0), linewidth=2.5);
sms_1.set_xticklabels(["No SMS Received","Yes SMS Received"]);
for p in sms_1.patches:
txt = str(p.get_height().round(2))
txt_x = p.get_x()
txt_y = p.get_height()
sms_1.text(txt_x+0.1,txt_y+6.8,txt)
#proportion plot
sns.set(font_scale=1.7)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
sms_received = df.groupby(['SMS_Received'])['Appointment_Missed'].value_counts(normalize=True)
sms_received = sms_received.mul(100)
sms_received = sms_received.rename('percent').reset_index()
sms_received_graph = sns.catplot(x='SMS_Received',y='percent',hue='Appointment_Missed',kind='bar',data=sms_received,palette=['#ffa600',"#007e6d"], edgecolor=(0,0,0), linewidth=2,legend_out=False)
sms_received_graph.fig.set_size_inches(10,13)
sms_received_graph.ax.set_ylim(0,100)
sms_received_graph.set_xticklabels(["No SMS Received","Yes SMS Received"]);
for p in sms_received_graph.ax.patches:
txt = str(p.get_height().round(1)) + '%'
txt_x = p.get_x()
txt_y = p.get_height()
sms_received_graph.ax.text(txt_x+0.1,txt_y+0.8,txt)
Looking at the proportion plot above, if equal amount of samples were present for each category, then patients receiving the SMS seem to show more number of appointments being missed than those who don't receive any SMS.
In this section, the factors that affect patients to miss their appointment will be analzyed. To carry on the analysis the original dataframe was filtered for the patients which missed their appointment. A preview of the dataframe is shown below.
import warnings
warnings.filterwarnings('ignore')
# create a list of our conditions
conditions = [
(df_yes['Age'] <= 14),
(df_yes['Age'] >= 15) & (df_yes['Age'] <= 64),
(df_yes['Age'] >= 65)]
# create a list of the values we want to assign for each condition
values = ['Children_Adolescents', 'Adult', 'Senior']
# create a new column and use np.select to assign values to it using our lists as arguments
df_yes['Age_Group'] = np.select(conditions, values)
df_yes = movecol(df_yes,
cols_to_move=['Age_Group'],
ref_col='Age',
place='After')
# display updated DataFrame
df_yes.head()
df.Appointment_Missed.value_counts(normalize=True)
Only 20% of the patients have missed their appointment. Now lets investigate what factors standout in helping us know whether the patient will miss ther appointment or not.
df_yes['Gender'].value_counts(normalize=True)
Source Code used below: https://stackoverflow.com/questions/35692781/python-plotting-percentage-in-seaborn-bar-plot
#gender factor
def without_hue(plot, feature):
total = len(feature)
for p in gender_yes.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total)
x = p.get_x() + p.get_width() / 2 - 0.05
y = p.get_y() + p.get_height()
gender_yes.annotate(percentage, (x-.1, y), size = 20)
plt.show()
plt.figure(figsize=(6,8))
sns.set(font_scale=1.55)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
#sns.set_palette("colorblind")
gender_yes = sns.countplot(x='Gender', data=df_yes,palette='gist_heat_r',edgecolor=(1,0,0), linewidth=1);
gender_yes.set_xticklabels(["Female","Male"]);
without_hue(gender_yes,df_yes.Gender)#'YlOrRd_r'
As shown in the plot above, females tend to miss their appointments more based on the data given. In the plot above, almost twice as more number of females have missed their appointments than males. Hence Gender is a strong indicator for a scheduled appointment to be missed.
df_yes['Age_Group'].value_counts(normalize=True)
#Age group factor
def without_hue(plot, feature):
total = len(feature)
for p in age_yes.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total)
x = p.get_x() + p.get_width() / 2 - 0.05
y = p.get_y() + p.get_height()
age_yes.annotate(percentage, (x-.1, y), size = 20)
plt.show()
plt.figure(figsize=(9,10))
sns.set(font_scale=1.55)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
age_yes = sns.countplot(x='Age_Group', data=df_yes,palette='gist_heat_r',edgecolor=(1,0,0), linewidth=1);
without_hue(age_yes,df_yes.Age_Group)#'YlOrRd_r'
As shown in the plot above, Adults tend to miss their appointments more based on the data given. In the plot above, almost thrice as more adults miss their appointments than children and almost 10 times as more adults missed their appointments than seniors. Hence Age Group is a strong indicator for a scheduled appointment to be missed.
Analyzing the hours at which the appointment was booked and whether the hour factor affected the appointment being missed or not.
#scheduled_hour factor
def without_hue(plot, feature):
total = len(feature)
for p in scheduled_hour_yes.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total)
x = p.get_x() + p.get_width() / 2 - 0.05
y = p.get_y() + p.get_height()
scheduled_hour_yes.annotate(percentage, (x-.2, y), size = 20)
plt.show()
plt.figure(figsize=(28,14))
sns.set(font_scale=1.55)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
scheduled_hour_yes = sns.countplot(x='Scheduled_Hour', data=df_yes,palette='gist_heat_r',edgecolor=(1,0,0), linewidth=1);
without_hue(scheduled_hour_yes,df_yes.Scheduled_Hour)
Analyzing the above plot, majority of appointments that were missed occured when the appointments were booked from 7 to 10 am and at 2 pm. Least number of appointments missed occured when the appointment was booked at 6 am and 5-9 pm.
To conclude, the scheduled hour factor is a good indicator to show if a patient will show up for their scheduled appointment or not.
Now breaking down the scheduled date factor further, I analyzed the weekday at which the appointment was booked and whether this factor affected the appointment being missed or not.
#scheduled_weekday factor
def without_hue(plot, feature):
total = len(feature)
for p in scheduled_weekday_yes.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total)
x = p.get_x() + p.get_width() / 2 - 0.05
y = p.get_y() + p.get_height()
scheduled_weekday_yes.annotate(percentage, (x-.2, y), size = 18)
plt.show()
plt.figure(figsize=(10,10))
sns.set(font_scale=1.55)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
scheduled_weekday_yes = sns.countplot(x='Scheduled_Weekday', data=df_yes,palette='gist_heat_r',edgecolor=(1,0,0), linewidth=1);
scheduled_weekday_yes.set_xticklabels(["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"]);
without_hue(scheduled_weekday_yes,df_yes.Scheduled_Weekday)#'YlOrRd_r'
Tuesday among the other weekdays have the most number of appointments missed followed by Wednesday and Monday. Whereas Saturday witnessed the least number of appointments attended and missed. Scheduled Weekday factor is not a strong indicator for appointments to be missed.
In the appointment date factor only the date was avaliable. So this factor was broken down into 2 subfactors: appointment month and weekday.
In this subfactor the month at which the appointment was booked for was analyzed. The analysis included whether the month of the appointment had an impact on the appointment being missed or not.
#Appointment Month factor
def without_hue(plot, feature):
total = len(feature)
for p in appointment_month_yes.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total)
x = p.get_x() + p.get_width() / 2 - 0.05
y = p.get_y() + p.get_height()
appointment_month_yes.annotate(percentage, (x-.1, y), size = 18)
plt.show()
plt.figure(figsize=(8,7))
sns.set(font_scale=1.55)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
appointment_month_yes = sns.countplot(x='Appointment_Month', data=df_yes,palette='gist_heat_r',edgecolor=(1,0,0), linewidth=1);
appointment_month_yes.set_xticklabels(["April","May","June"]);
without_hue(appointment_month_yes,df_yes.Appointment_Month)
Analyzing the plot above, the month of May received the most number of missed appointments almost 26 times more than April and 3 times more than June. Based on this, the appointment month factor could be a good predictor to show if a patient will show up for their scheduled appointment or not, but we need more data from other months to make an accurate analysis.
In this factor the weekday at which the appointment was booked for was analyzed. The analysis included whether the weekday that the appointment happens to fall on had an impact on the appointment being missed or not.
#Appointment Weekday factor
def without_hue(plot, feature):
total = len(feature)
for p in appointment_weekday_yes.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total)
x = p.get_x() + p.get_width() / 2 - 0.05
y = p.get_y() + p.get_height()
appointment_weekday_yes.annotate(percentage, (x-.2, y), size = 18)
plt.show()
plt.figure(figsize=(11,8))
sns.set(font_scale=1.55)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
appointment_weekday_yes = sns.countplot(x='Appointment_Weekday', data=df_yes,palette='gist_heat_r',edgecolor=(1,0,0), linewidth=1);
appointment_weekday_yes.set_xticklabels(["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"]);
without_hue(appointment_weekday_yes,df_yes.Appointment_Weekday)#'YlOrRd_r'
Tuesday among the other weekdays have the most number of appointments missed followed by Wednesday and Monday. Whereas Saturday witnessed the least number of appointments missed. Appointment Weekday factor is not a strong indicator for appointments to be missed.
In the Neighbourhood factor analysis, the plot below shows which neighbourhood faced missed appointments. As shown below JARDIM CAMBURI has the most number of missed appointments followed by MARIA ORTIZ. On the contrary there were no missed appointments in ILHA DO BOI, ILHA DO FRADE and AEROPORTO neighbourhood.
df_yes['Neighbourhood'].value_counts(normalize=True).sort_values(ascending=False)
#Neighbourhood factor
def without_hue(plot, feature):
total = len(feature)
for p in neighbourhood_yes.patches:
percentage = '{:.1f}%'.format(100 * p.get_width()/total)
x = p.get_x() + p.get_width() / 2 - 0.05
y = p.get_y() + p.get_height()
neighbourhood_yes.annotate(percentage, (x+4, y), size = 50)
plt.show()
descending_order = df['Neighbourhood'].value_counts().sort_values(ascending=False).index
plt.figure(figsize=(110,140))
sns.set(font_scale=4)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
neighbourhood_yes =sns.countplot(y='Neighbourhood', data=df_yes,palette=['#ffa600'],edgecolor=(0,0,0), linewidth=1, order=descending_order);
without_hue(neighbourhood_yes,df_yes.Neighbourhood)#'YlOrRd_r'
In conclusion, the neighbourhood factor is a good predictor given we have more detail about the socio-economic conditions so that a more accurate analysis could be carried out.
The Scholarship factor indicates whether or not the patient is enrolled in Brasilian welfare program called Bolsa FamÃlia and how this contributed to appointments being missed. As shown below in the plot, patients with scholarship had 8 times less appointment being missed than those patients who have no scholarship.
#Scholarship factor
def without_hue(plot, feature):
total = len(feature)
for p in scholarship_missed.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total)
x = p.get_x() + p.get_width() / 2 - 0.05
y = p.get_y() + p.get_height()
scholarship_missed.annotate(percentage, (x-.12, y), size = 18)
plt.show()
plt.figure(figsize=(6,7))
sns.set(font_scale=1.55)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
scholarship_missed = sns.countplot(x='Scholarship', data=df_yes,palette='gist_heat_r',edgecolor=(1,0,0), linewidth=1);
scholarship_missed.set_xticklabels(["No Scholarship","Scholarship: Granted"]);
without_hue(scholarship_missed,df_yes.Scholarship)#'YlOrRd_r'
In conclusion, the scholarship factor is a very strong indicator for appointments to be missed.
The Hypertension factor indicates whether or not the patient has hypertension and how this contributed to appointments being missed. As shown below in the plot, patients with no hypertension had about 5 times more appointments being missed than those patients who have hypertension.
#Hypertension factor
def without_hue(plot, feature):
total = len(feature)
for p in hypertension_missed.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total)
x = p.get_x() + p.get_width() / 2 - 0.05
y = p.get_y() + p.get_height()
hypertension_missed.annotate(percentage, (x-.05, y), size = 18)
plt.show()
plt.figure(figsize=(8,9))
sns.set(font_scale=1.55)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
hypertension_missed = sns.countplot(x='Hypertension', data=df_yes,palette='gist_heat_r',edgecolor=(1,0,0), linewidth=1);
hypertension_missed.set_xticklabels(["Hypertension Patient: No","Hypertension Patient: Yes"]);
without_hue(hypertension_missed,df_yes.Hypertension)#'YlOrRd_r'
In conclusion, the hypertension factor is a very strong indicator for appointments to be missed.
The Diabetes factor indicates whether or not the patient has diabetes and how this contributed to appointments being missed. As shown below in the plot, patients with no diabetes had about 15 times more appointments being missed than those patients who have diabetes.
#Diabetes factor
def without_hue(plot, feature):
total = len(feature)
for p in diabetes_missed.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total)
x = p.get_x() + p.get_width() / 2 - 0.05
y = p.get_y() + p.get_height()
diabetes_missed.annotate(percentage, (x-0.09, y), size = 18)
plt.show()
plt.figure(figsize=(6,8))
sns.set(font_scale=1.55)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
diabetes_missed = sns.countplot(x='Diabetes', data=df_yes,palette='gist_heat_r',edgecolor=(1,0,0), linewidth=1);
diabetes_missed.set_xticklabels(["Non Diabetic","Diabetic"]);
without_hue(diabetes_missed,df_yes.Diabetes)#'YlOrRd_r'
In conclusion, the diabetes factor is a very strong indicator for appointments to be missed.
The Alcoholism factor indicates whether or not the patient has alcoholism and how this contributed to appointments being missed. As shown below in the plot, patients with no alcoholism had about 32 times more appointments being missed than those patients who have alcoholism.
df_yes.Alcoholism.value_counts(normalize=True)
#Alcoholism factor
def without_hue(plot, feature):
total = len(feature)
for p in alcoholism_missed.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total)
x = p.get_x() + p.get_width() / 2 - 0.05
y = p.get_y() + p.get_height()
alcoholism_missed.annotate(percentage, (x-0.09, y), size = 18)
plt.show()
plt.figure(figsize=(6,8))
sns.set(font_scale=1.55)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
alcoholism_missed = sns.countplot(x='Alcoholism', data=df_yes,palette='gist_heat_r',edgecolor=(1,0,0), linewidth=1);
alcoholism_missed.set_xticklabels(["No Alcoholism","Yes Alcoholism"]);
without_hue(alcoholism_missed,df_yes.Alcoholism)#'YlOrRd_r'
In conclusion, the alcoholism factor is a very strong indicator for appointments to be missed.
The Handicap factor indicates whether or not the patient is handicapped and how this contributed to appointments being missed. As shown below in the plot, patients who are not handicap had about 61 times more appointments being missed than those patients who are handicap.
df_yes.Handicap.value_counts(normalize=True).round(3)
#Handicap factor
def without_hue(plot, feature):
total = len(feature)
for p in handicap_missed.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total)
x = p.get_x() + p.get_width() / 2 - 0.05
y = p.get_y() + p.get_height()
handicap_missed.annotate(percentage, (x-0.09, y), size = 18)
plt.show()
plt.figure(figsize=(6,8))
sns.set(font_scale=1.55)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
handicap_missed = sns.countplot(x='Handicap', data=df_yes,palette='gist_heat_r',edgecolor=(1,0,0), linewidth=1);
handicap_missed.set_xticklabels(["No Handicap","Handicapped"]);
without_hue(handicap_missed,df_yes.Handicap)#'YlOrRd_r'
In conclusion, the handicap factor is a very strong indicator for appointments to be missed.
The SMS received factor indicates whether or not the patient received SMS prior to the appointment and how receiving a SMS contributed to appointments being missed or not. As shown in the plot below, patients who received SMS missed around the same amount of appointments as those patients who didnt receive SMS.
df_yes.SMS_Received.value_counts(normalize=True).round(3)
#SMS factor
def without_hue(plot, feature):
total = len(feature)
for p in sms_missed.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total)
x = p.get_x() + p.get_width() / 2 - 0.05
y = p.get_y() + p.get_height()
sms_missed.annotate(percentage, (x-0.09, y), size = 18)
plt.show()
plt.figure(figsize=(6,8))
sns.set(font_scale=1.55)
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
sms_missed = sns.countplot(x='SMS_Received', data=df_yes,palette='gist_heat_r',edgecolor=(1,0,0), linewidth=1);
sms_missed.set_xticklabels(["No SMS Received","SMS Received"]);
without_hue(sms_missed,df_yes.SMS_Received)
In conclusion, the SMS Received factor is not a very strong indicator for appointments to be missed.
More data is needed to further make the outcomes more credible. In addition, for neighbourhood factor analysis more information is needed on the socio-economic conditions. Lastly, more characteristics about patient's medical history and other personal information can help to come to a more concrete conclusion.